Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)
Solution 1.1: Speed, memory, and data types
There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.
Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage; all these readers can take gz file as input without explicit decompression.)
# Measure time and memory for read.csv (base R)time_read_csv_base <-system.time({ data_base <-read.csv("~/mimic/hosp/admissions.csv.gz")})memory_read_csv_base <-object_size(data_base)# Measure time and memory for read_csv (tidyverse)time_read_csv_tidyverse <-system.time({ data_tidyverse <-read_csv("~/mimic/hosp/admissions.csv.gz",show_col_types =FALSE)})memory_read_csv_tidyverse <-object_size(data_tidyverse)# Measure time and memory for fread (data.table)time_read_fread <-system.time({ data_fread <-fread("~/mimic/hosp/admissions.csv.gz")})memory_read_fread <-object_size(data_fread)
# Output the results for runtimecat("Runtime (in seconds):\n")
Runtime (in seconds):
cat("Base R read.csv: ", time_read_csv_base["elapsed"], " seconds\n")
fread from the data.table package is the fastest, with a runtime of approx 0.286 seconds.
Difference in Parsed Data Types:
read.csv (base R) converts character columns into factors by default.
read_csv (tidyverse) and fread (data.table) do not convert characters to factors by default, and they infer column types automatically.
Memory Usage:
fread (data.table) is the most memory-efficient, using approx 63,465,008 bytes (~63 MB).
read_csv (tidyverse) is the second most efficient, using approx 70,022,592 bytes (~70 MB).
read.csv (base R) uses the most memory at approx 200,098,832 bytes (~200 MB)
Solution 1.2: User-supplied data types
Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)
# Measure time and memory usage for read_csv without col_typestime_read_csv_without_coltypes <-system.time({ data_tidyverse <-read_csv("~/mimic/hosp/admissions.csv.gz",# Silent the Column specificationshow_col_types =FALSE)})memory_read_csv_without_coltypes <-object_size(data_tidyverse)
# Specifying column types based on data structurecol_types <-cols(subject_id =col_integer(),hadm_id =col_integer(),admittime =col_datetime(format =""),dischtime =col_datetime(format =""),deathtime =col_datetime(format =""),admission_type =col_character(),admit_provider_id =col_character(),admission_location =col_character(),discharge_location =col_character(),insurance =col_character(),language =col_character(),marital_status =col_character(),race =col_character(),edregtime =col_datetime(format =""),edouttime =col_datetime(format =""),hospital_expire_flag =col_logical())# Measure time and memory usage for read_csv with col_typestime_read_csv_coltypes <-system.time({ data_tidyverse_coltypes <-read_csv("~/mimic/hosp/admissions.csv.gz",col_types = col_types)})memory_read_csv_coltypes <-object_size(data_tidyverse_coltypes)
# Compare the resultscat("Runtime with col_types specified: ", time_read_csv_coltypes["elapsed"]," seconds\n")
Runtime with col_types specified: 0.521 seconds
cat("Memory usage with col_types specified: ", memory_read_csv_coltypes," bytes\n")
Memory usage with col_types specified: 63470560 bytes
cat("Runtime without col_types: ", time_read_csv_without_coltypes["elapsed"]," seconds\n")
Runtime without col_types: 0.568 seconds
cat("Memory usage without col_types: ", memory_read_csv_without_coltypes," bytes\n")
Memory usage without col_types: 70022592 bytes
Runtime change: Yes, the runtime with the col_types specified is slightly lower than the runtime without col_types. This indicates that specifying column types increased the loading speed by a minor difference.
Memory usage: The memory usage is 63,470,560 bytes with the col_types specified and 70,022,592 bytes with read_csv. The difference in memory usage is huge (6552032 bytes), meaning that specifying column types will optimize the memory instead of using default types where double takes more memory in case we have integer types.
Q2. Ingest big data files
Solution 2.1 Ingest labevents.csv.gz by read_csv
Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 3 minutes on your computer, then abort the program and report your findings.
-rw-r--r--@ 1 sakshihiteshoza staff 2592909134 Jan 24 15:14 /Users/sakshihiteshoza/mimic/hosp/labevents.csv.gz
cat("Memory of labevents.csv.gz: ", 2592909134*9.5367e-7, " MiB\n")
Memory of labevents.csv.gz: 2472.78 MiB
read_csv cannot ingest labevents.csv.gz because the available ram is ~284 MiB where as the compressed file labevents.csv.gz requires about 2473 MiB.
Solution 2.2 Ingest selected columns of labevents.csv.gz by read_csv
Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)
labevents_data <-read_csv("~/mimic/hosp/labevents.csv.gz",col_select =c(subject_id, itemid, charttime, valuenum))# Check the datahead(labevents_data)
It took more than 3 minutes
Reason for unable to load the data
zcat< ~/mimic/hosp/labevents.csv.gz |wc-l
158374765
subject_id_one_row =4# integer is 4 bytesitemid_one_row =4# integer is 4 bytesvaluenum_one_row =8# double is 8 bytescharttime_one_row =4# timestamp is 4 bytesone_row_total = subject_id_one_row + itemid_one_row + valuenum_one_rowone_row_total = one_row_total + charttime_one_row# Total rows from above cell's outputtotal_rows =158374765cat("Memory required with colselect labevents.csv.gz: ", one_row_total * total_rows *9.5367e-7, " MiB\n")
Memory required with colselect labevents.csv.gz: 3020.745 MiB
read_csv cannot ingest labevents.csv.gz with selected columns too because the available ram is ~284 MiB where as memory required requires about 3021 MiB.
Solution 2.3 Ingest a subset of labevents.csv.gz
Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.
In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: Use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. Do not put labevents_filtered.csv.gz in Git! To save render time, you can put #| eval: false at the beginning of this code chunk. TA will change it to #| eval: true before rendering your qmd file.)
Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file, excluding the header? How long does it take read_csv to ingest labevents_filtered.csv.gz?
It took about 16 seconds overall to load the file, out of which 9.741 seconds were spent running your R code (user time), and about 3 seconds were used by the system for tasks like file I/O. The remaining time could be due to overhead like waiting for resources, parallel processing inefficiencies, or other factors.
print(nrow(labevents_filtered))
[1] 407312
print(ncol(labevents_filtered))
[1] 4
Solution 2.4 Ingest labevents.csv by Apache Arrow
Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory (do not add it in git!). To save render time, put #| eval: false at the beginning of this code chunk. TA will change it to #| eval: true when rendering your qmd file.
Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.
system("gunzip -k ./labevents_filtered.csv.gz")
library(arrow)library(dplyr)# Measure time taken for ingestion, selection, and filtering using Apache Arrowtime_taken_arrow <-system.time({# Load the dataset using Apache Arrow dataset <-open_dataset("./labevents_filtered.csv",format ="csv")# Select the relevant columns and filter based on the 'itemid' values# The itemid values correspond to the following lab items:# Creatinine (50912), Potassium (50971), Sodium (50983), Chloride (50902),# Bicarbonate (50882), Hematocrit (51221), White blood cell count (51301),# and Glucose (50931) result <- dataset %>%select(subject_id, itemid, charttime, valuenum) %>%filter(itemid %in%c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))# Collect the result as a tibble result_tibble <- result %>%collect()})# Print the time taken for the entire processprint(time_taken_arrow)
user system elapsed
5.126 0.751 5.256
# Display the number of rows and the first 10 rows of the result tibbleprint(nrow(result_tibble))
The process of ingesting, selecting columns, and filtering rows using Apache Arrow took approximately 3.9 seconds in total. This includes: - User time: 3.8 seconds for executing code. - System time: 0.67 seconds for system-level operations.
Apache Arrow is a framework that enhances data processing by providing an efficient in-memory columnar format for large datasets. It allows data to be processed quickly and shared across different programming languages without needing to copy data. This makes it highly suitable for modern data analytics pipelines, especially when working with datasets that are too large to fit into memory.
Solution 2.5 Compress labevents.csv to Parquet format and ingest/select/filter
Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.
library(arrow)library(dplyr)# Write the CSV file to Parquet formatarrow::write_dataset(open_dataset("./labevents_filtered.csv", format ="csv"),path ="./labevents_filtered.parque",format ="parquet")# Check the size of the Parquet fileparquet_file_info <-file.info("./labevents_filtered.parquet")print(parquet_file_info$size)
[1] 96
# Measure the time taken for ingestion, selection, and filtering using# Parquet formattime_taken_parquet <-system.time({# Load the Parquet dataset dataset_parquet <-open_dataset("./labevents_filtered.parquet",format ="parquet")# Select relevant columns and filter based on itemid values result_parquet <- dataset_parquet %>%select(subject_id, itemid, charttime, valuenum) %>%filter(itemid %in%c(50912, 50971, 50983, 50902,50882, 51221, 51301, 50931)) %>%collect()})# Print the time taken for the processprint(time_taken_parquet)
user system elapsed
0.967 0.744 0.759
# Display the number of rows and the first 10 rows of the result tibbleprint(nrow(result_parquet))
The process of ingesting, selecting, and filtering the Parquet file took 0.611 seconds, with 32,679,896 rows in the filtered data. This shows Parquet’s efficiency compared to CSV for handling large datasets.
Parquet is a special file format designed to store large amounts of data efficiently. Unlike regular text files like CSV, Parquet compresses the data, making it smaller and faster to work with. It also organizes data in a way that allows programs to quickly find and read only the parts they need, rather than loading the entire file. This saves time and space, especially when dealing with big datasets.
Solution 2.6 DuckDB
Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.
library(arrow)library(duckdb)library(dplyr)# Measure time taken for the entire processtime_taken_duckdb <-system.time({# Ingest Parquet file and convert it to a DuckDB table duckdb_table <-to_duckdb(open_dataset("./labevents_filtered.parquet", format ="parquet") )# Select relevant columns and filter based on itemid values result_duckdb <- duckdb_table %>%select(subject_id, itemid, charttime, valuenum) %>%filter(itemid %in%c(50912, 50971, 50983, 50902,50882, 51221, 51301, 50931)) %>%collect()})# Print the time taken for the entire processprint(time_taken_duckdb)
user system elapsed
2.085 0.796 1.971
# Display the number of rows and the first 10 rows of the result tibbleprint(nrow(result_duckdb))
The process of ingesting the Parquet file, converting it to a DuckDB table, and selecting and filtering data took 1.11 seconds, with 32,679,896 rows in the filtered result.
DuckDB is like a mini, super-fast database that runs right on your computer without needing to set up any complex systems. It’s designed to handle big datasets quickly and efficiently, much like big databases such as SQLite, but is specifically optimized for analytics and processing large files, including Parquet files.
Q3. Ingest and filter chartevents.csv.gz
chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.
zcat< ~/mimic/icu/d_items.csv.gz |head-10
In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.
Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.
library(arrow)library(dplyr)# Measure time taken for ingestion, selection, and filtering using Apache Arrowtime_taken_arrow <-system.time({# Load the dataset using Apache Arrow dataset <-open_dataset("./chartevents_filtered.csv", format ="csv")# Assuming the correct column names from the schema, rename accordingly# Update these names based on the actual schema renamed_dataset <- dataset %>%rename(# Replace with actual subject_id column namesubject_id =`10000032`,# Replace with actual charttime column namecharttime =`2180-07-23 14:00:00`,# Replace with actual item_id column nameitem_id =`223761`,# Replace with actual valuenum column namevaluenum =`98.7` )# Select the relevant columns and filter based on the 'item_id' values result <- renamed_dataset %>%select(subject_id, charttime, item_id, valuenum) %>%filter(item_id %in%c(220045, 220181, 220179, 223761, 220210))# Collect the result as a tibble result_tibble <- result %>%collect()})# Print the time taken for the entire processprint(time_taken_arrow)
user system elapsed
3.124 0.725 3.473
# Display the number of rows and the first 10 rows of the result tibbleprint(nrow(result_tibble))
# Check number of rows in original compressed file including headerzcat< chartevents_filtered.csv.gz |tail-n +2 |wc-l
24816685
# Check number of columnszcat< chartevents_filtered.csv.gz |head-n 1 |awk-F',''{print NF}'
4
The number of rows(24816685) and columns(4) match correctly in the compressed file and the loaded dataset.
Ingests: Reads the CSV file using Apache Arrow’s open_dataset(), which is designed to handle larger-than-memory datasets efficiently.
Renames Columns: Changes the names of the columns from their raw schema format to more understandable ones (subject_id, charttime, etc.).
Selects and Filters: Focuses on specific columns and filters the rows based on conditions applied to the item_id column.
Collects the Data: The result is collected into a local tibble, allowing it to be printed and analyzed in memory.
Timing: Measures the time taken for the entire process of ingestion, renaming, selection, filtering, and collection.
The process of loading the dataset using Apache Arrow, selecting relevant columns, and filtering based on item IDs took around 2.82 seconds and resulted in a dataset with 24.8 million rows. This demonstrates the efficiency of using Apache Arrow for handling large datasets.